Background

Climate change is one of the biggest challenges facing humanity today resulting among other effects in more frequent and severe weather events. The Federal Emergency Management Agency (FEMA) is part of Homeland Security in the USA and responsible for coordinating the response to disasters and emergencies, both natural and man-made. FEMA provides support to state and local governments, as well as to individuals affected by disasters, by offering financial assistance, providing temporary housing, and assisting with recovery efforts. The agency also plays a key role in disaster preparedness and risk reduction through programs that help communities develop plans and take steps to reduce the impact of future disasters.

Description of data set(s)

FEMA provides various data sets on disasters that occurred in the past, some of them going back to the 1960s. For our project we merged two data sets from FEMA and enriched them with additional information like geographical data (e.g. longitude, latitude) and population size per state from Wikipedia. The first data set from FEMA contains for example information on the type of disaster (fire, flooding, etc.), the state in which the disaster occurred and the dates when it occurred. From the second data set, we pulled information on the amount of damage each disaster caused, measured in amount of money. Since the second data set only goes back to 2002, we decided to do our analysis in the time frame of the past 20 years.

After merging and cleaning the data from the four sources, our final data set to do the analysis on looked like this:

Disaster# Type County State Latitude Longitude Damage Approved Repair Population IncomeCapita DisasterBegin DisasterEnd Duration
1971 Severe Storm Marshall Alabama 34.3670 -86.3066 468614.48 116003.84 101089.45 96137 20382 2011-04-15 2011-05-31 46 days
4655 Flood Park Montana 45.4884 -110.5267 2901.34 12532.99 11078.99 16513 24611 2022-06-10 2022-07-05 25 days
4637 Tornado Cheatham Tennessee 36.2612 -87.0868 7753.23 14653.23 6344.23 40539 23459 2021-12-10 2021-12-11 1 days
4633 Tornado Jackson Arkansas 35.5992 -91.2145 199.50 199.50 0.00 16908 15609 2021-12-10 2021-12-11 1 days
4630 Tornado Lyon Kentucky 37.0191 -88.0833 0.00 0.00 0.00 8226 22123 2021-12-10 2021-12-11 1 days
4618 Hurricane Bucks Pennsylvania 40.3370 -75.1069 92390.42 52031.76 42999.73 627668 37466 2021-08-31 2021-09-05 5 days

Exploratory Data Analysis

Relative frequency of disaster types

Displaying the cumulative frequencies of each disaster type relative to each other, we can learn from the stacked bar chart that “severe storms” are by far the commonest disaster type, followed by hurricanes and biological disasters.

Distribution of disaster types over time

Various insights can be gained from the time plot below. For example, it looks like severe storms happen quite frequently in general except for the years of approximately 2013-2015 and 2017/2018-ish. Furthermore, only one volcanic eruption and one severe ice storm was recorded in the displayed time period. Dam or Levee breaks and biological disasters did not occur between 2000 and 2023. We could hypothesize that fires seemed to last longer in years previous to about 2013 compared to afterwards - maybe due to more efficient ways that were developed to extinguish them. However, in this plot, the duration of almost 500 disasters are shown which might overlap. Since using 500 different colors for each individual disaster might not help much, we would switch to a different way of showing this information, for example an interactive graphic with Shiny.

Duration per disaster type

Exploring the disaster types a little bit further, we can detect in the boxplot below that fires show the largest variation in terms of duration. The disaster with the biggest median duration is volcanic eruptions. However, this observation should be analyzed further since it looks like there was only one incidence in the analysed time frame (only the median is shown, no variation visible which means that there is only one data point in this group). Ouliers with very long durations can be detected in the groups of floods, hurricanes, severe storms, and tornados.

Damage per state

To analyse the data on damage, we first took a look at the distribution of the data points. The histogram indicated a right skewed distribution.

Also, the difference between the individual damage amounts are quite large which makes it hard to work with. Therefore, we took the logarithm of the data, which resulted in displaying a normal distribution:

With the transformed data, we could generate a plot showing the damage disasters have caused per state in the US over the past 20 years:

## function (x, ...) 
## {
##     if (need_screenshot(x, ...)) {
##         html_screenshot(x)
##     }
##     else {
##         UseMethod("knit_print")
##     }
## }
## <bytecode: 0x12b72e678>
## <environment: namespace:knitr>

Population per state

The following map shows the population per state. Similar to above, we are using the log-transformed data.

Actual Damage vs. approved amount

How much does the estimated damage and the approved amount differ? As before, we worked with log-transformed data. By displaying overlapping histograms, we can conclude that there are a few instances where the approved amount is a lot lower compared to the damage that actually happened. Vice versa, there are instances where the approved amount exceded the actual damage.

Approved and actually used (repair)

When comparing the amount of money that was approved to repair the damage with the actual costs spent on reconstructions, we can see that in almost all cases, less money was spent than approved.

Diving into this in a bit more detail, we can distinguish between disaster types. For some, like tornados or severe ice storms, the approved amounts exceed the spent amounts far more compared to others like earthquakes or floods where the two amounts are much more equal.

boxplot.repair.df <- data.frame(ApprovedLog = log(df.prep.5$Approved),
                               RepairLog = log(df.prep.5$Repair),
                      Type = df.prep.5$Type)

boxplot.repair.num.df <- boxplot.repair.df %>% 
  filter_if(~is.numeric(.), all_vars(!is.infinite(.)))

b.rep.inf <- boxplot.repair.num.df %>%
  pivot_longer(ApprovedLog:RepairLog, names_to = "Names", values_to = "values") %>%
  ggplot(aes(y = values, x = Type, fill = Names))+
  geom_boxplot() + 
  facet_wrap(~Type, scale="free")+
  ggtitle('Approved and Repair Amount log-transformed without Inf Rows')
b.rep.inf + theme(
  plot.title = element_text(size=12, face="bold.italic"),
  axis.text.x=element_blank())

b.rep <- boxplot.repair.df %>%
  pivot_longer(ApprovedLog:RepairLog, names_to = "Names", values_to = "values") %>%
  ggplot(aes(y = values, x = Type, fill = Names))+
  geom_boxplot() +
  facet_wrap(~Type, scale="free")+
  ggtitle('Approved and Repair Amount log-transformed with Inf Rows')
b.rep+
  theme(
  plot.title = element_text(color = 'red',size=12, face="bold.italic"),
  axis.text.x=element_blank())

# ggarrange(b.rep.inf, b.rep, 
#           labels = c("With Infinite Numbers", "Without Infinite Number"),
#           ncol = 1, nrow = 2)

Shiny Dashboard for EDA

The Shiny-Dashboard provides an interactive possibility to further inspect aspects of the natural disasters and where they occur. https://milicapajkic.shinyapps.io/Dashboard/

This dashboard is helpful to interactively explore, which disaster has occured in which state resp. county. The disaster type needs to be set first and only the states and their counties, in which these disasters occured, are “clickable”. For example, if the insight we are looking for is tornado, we can see, that these disaster types only occured in 13 States and dominantely in the Southern (e.g. Georgia) or Midwestern states.

Duration of Disaster Type by State and County. The data used for this plot is not really suitable for a heatmap but for learning purposes we wanted to try it out. The reason being, that not only counties are included in the dataframe, because not everyone was included in the FEMA. Therefore, our income per capita and population was also reduced.

Model

In this Chapter we want to see, if the amount of Damage can be predicted with the given data points.

H1: With a higher income per capita, the damage will be lower.

The reason being, that with higher income capita, the preparation and precaution for natural disaster are higher. For one, there is no monetary strain like in other counties with less income per capita.

H2: The damage will be lower, depending which natural catastrophe is happening.

First, we will need to see the distribution of the important variables. It appears, that damage and income per capita needs to be logarithmized. So we logarithimized it.

The scatter plot indicates that there is a positive relationship between income per capita and damage.

To see if the chosen variables have an influence on damage a testing was done (drop1). The model.full indicates that only log.income, Type and Duration have relevant effects on the dependent variable damage.

Inspection

As seen in the previous chapter, damage and income per capita needs to be log-transformed. The scatter plot before logarithmizing the data is very one sided. When using the logarithmized data the output and linear regression plot shows a slight positive relationship.

ggplot(data = df.prep.5, 
       mapping = aes(x = IncomeCapita, fill = '')) +
  geom_histogram(show.legend = FALSE)+
  labs(x = 'Income', y = 'Count') +
  scale_fill_manual(values = c('lightblue2'))+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=18,face="bold"))+
  ggtitle("Histogram of Income per Capita") +
  theme(plot.title = element_text(size = 20, face = "bold"))

ggplot(data = df.prep.5, 
       mapping = aes(x = log(IncomeCapita), fill = '')) +
  geom_histogram(show.legend = FALSE)+
  labs(x = 'log. Income', y = 'Count') +
  scale_fill_manual(values = c('mistyrose3'))+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=18,face="bold"))+ 
  ggtitle("Histogram of Income per Capita (log)") +
  theme(plot.title = element_text(size = 20, face = "bold"))

plot(df.prep.5$IncomeCapita, df.prep.5$Damage, 
     main = "Scatter Plot of Damage and Income per Capita",
     xlab = "Income per Capita", ylab = "Damage")

plot(log(df.prep.5$IncomeCapita), log(df.prep.5$Damage), 
     main = "Scatter Plot of Logarithmic Damage and Logarithmic Income per Capita",
     xlab = "Logarithmic Income per Capita", ylab = "Logarithmic Damage")

ggplot(df.prep.5, aes(log(Damage),log(IncomeCapita))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = 'log. Income', y = 'log. Damage')+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=18,face="bold"))+ 
  ggtitle("Scatter Plot: Damage (log) against Income per Capita (log)") +
  theme(plot.title = element_text(size = 20, face = "bold"))

Linear Regression

df.model <- df.prep.5
df.model$Type <- relevel(df.model$Type, ref = "Dam/Levee Break")
df.model$log.damage <- log(df.model$Damage)
df.model$log.income <- log(df.model$IncomeCapita)
# 
df.model <- df.model[!is.na(df.model$log.damage) & is.finite(df.model$log.damage), ]
# 
model <- lm(log.damage ~ log.income, data = na.omit(df.model))
summary(model)
## 
## Call:
## lm(formula = log.damage ~ log.income, data = na.omit(df.model))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2833 -1.7712 -0.1211  1.6129  8.7048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -3.5667     5.2192  -0.683   0.4949  
## log.income    1.2822     0.5184   2.473   0.0139 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.421 on 309 degrees of freedom
## Multiple R-squared:  0.01941,    Adjusted R-squared:  0.01624 
## F-statistic: 6.117 on 1 and 309 DF,  p-value: 0.01392

Model with all the variables

In this model we are fitting all the variables and are afterwards implementing drop1 to see, which variable has a significant influence on the dependent variable. After this, we are dropping the variables with no significant influence on the variable damage. The final model includes income per capita, type of disaster and the duration.

model.full <- lm(log.damage ~ log.income + State+ Type+ Population+ Duration, data = na.omit(df.model))
summary(model.full)
## 
## Call:
## lm(formula = log.damage ~ log.income + State + Type + Population + 
##     Duration, data = na.omit(df.model))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.364 -1.379  0.000  1.358  6.059 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           -4.882e+00  7.277e+00  -0.671   0.5029  
## log.income             1.451e+00  7.046e-01   2.060   0.0405 *
## StateAlaska           -1.311e+00  1.723e+00  -0.761   0.4473  
## StateArkansas          9.686e-01  9.610e-01   1.008   0.3145  
## StateCalifornia        8.646e-01  1.401e+00   0.617   0.5377  
## StateColorado         -1.704e+00  1.760e+00  -0.969   0.3337  
## StateConnecticut      -4.361e-01  1.397e+00  -0.312   0.7551  
## StateDelaware         -1.544e+00  2.424e+00  -0.637   0.5246  
## StateFlorida           1.421e-01  8.456e-01   0.168   0.8667  
## StateGeorgia          -8.184e-01  9.437e-01  -0.867   0.3867  
## StateHawaii           -4.165e-01  1.550e+00  -0.269   0.7884  
## StateIllinois          3.577e-01  9.325e-01   0.384   0.7016  
## StateIndiana          -3.766e-01  9.327e-01  -0.404   0.6867  
## StateIowa             -4.459e-01  1.331e+00  -0.335   0.7379  
## StateKansas            4.305e-02  1.475e+00   0.029   0.9767  
## StateKentucky         -9.493e-02  8.535e-01  -0.111   0.9115  
## StateLouisiana         1.670e-01  8.970e-01   0.186   0.8524  
## StateMaine            -1.158e+00  1.760e+00  -0.658   0.5111  
## StateMaryland          1.438e+00  1.782e+00   0.807   0.4205  
## StateMassachusetts    -9.197e-01  1.255e+00  -0.733   0.4645  
## StateMichigan          4.720e-01  1.504e+00   0.314   0.7539  
## StateMinnesota        -1.209e+00  1.336e+00  -0.905   0.3662  
## StateMississippi       9.258e-01  7.989e-01   1.159   0.2476  
## StateMissouri         -3.151e-01  9.981e-01  -0.316   0.7525  
## StateMontana          -1.686e+00  1.797e+00  -0.938   0.3492  
## StateNebraska         -2.212e+00  1.487e+00  -1.487   0.1382  
## StateNevada            5.371e+00  2.382e+00   2.254   0.0250 *
## StateNew Hampshire    -1.793e+00  1.353e+00  -1.325   0.1864  
## StateNew Jersey       -8.625e-01  1.095e+00  -0.788   0.4317  
## StateNew Mexico        5.113e+00  2.382e+00   2.147   0.0328 *
## StateNew York         -1.380e+00  9.197e-01  -1.501   0.1346  
## StateNorth Carolina   -2.630e-01  1.084e+00  -0.243   0.8086  
## StateNorth Dakota     -7.434e-01  1.839e+00  -0.404   0.6863  
## StateOhio             -4.104e-02  1.001e+00  -0.041   0.9673  
## StateOklahoma         -1.962e-01  9.490e-01  -0.207   0.8364  
## StateOregon           -5.421e-02  1.755e+00  -0.031   0.9754  
## StatePennsylvania     -7.499e-03  1.166e+00  -0.006   0.9949  
## StateRhode Island      2.189e-01  2.450e+00   0.089   0.9289  
## StateSouth Carolina    1.276e+00  1.321e+00   0.965   0.3353  
## StateSouth Dakota     -1.277e+00  1.369e+00  -0.933   0.3518  
## StateTennessee        -7.755e-01  9.151e-01  -0.847   0.3975  
## StateTexas            -1.098e-01  8.562e-01  -0.128   0.8980  
## StateVermont          -1.653e+00  2.395e+00  -0.690   0.4907  
## StateVirginia         -6.829e-01  1.150e+00  -0.594   0.5532  
## StateWashington        8.993e-02  1.753e+00   0.051   0.9591  
## StateWest Virginia    -7.573e-01  8.890e-01  -0.852   0.3951  
## StateWisconsin         4.020e-01  1.478e+00   0.272   0.7859  
## StateWyoming          -2.834e+00  2.434e+00  -1.165   0.2452  
## TypeEarthquake         1.446e+00  2.973e+00   0.486   0.6272  
## TypeFire               1.291e+00  2.933e+00   0.440   0.6603  
## TypeFlood             -2.039e-01  2.692e+00  -0.076   0.9397  
## TypeHurricane         -4.446e-01  2.707e+00  -0.164   0.8697  
## TypeOther              3.339e+00  3.645e+00   0.916   0.3606  
## TypeSevere Ice Storm  -9.198e-01  3.163e+00  -0.291   0.7715  
## TypeSevere Storm      -5.954e-01  2.674e+00  -0.223   0.8240  
## TypeTornado           -2.885e+00  2.785e+00  -1.036   0.3014  
## TypeVolcanic Eruption  7.260e+00  3.827e+00   1.897   0.0590 .
## Population             1.587e-07  2.442e-07   0.650   0.5165  
## Duration               1.327e-02  6.997e-03   1.897   0.0590 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.302 on 252 degrees of freedom
## Multiple R-squared:  0.2769, Adjusted R-squared:  0.1105 
## F-statistic: 1.664 on 58 and 252 DF,  p-value: 0.004167

Income per Capita (log) has a significant effect on damage, as well as the type of disaster. The independent variable Duration has a small effect, however, we will be including it into the final model. In a next step, the different levels of the variable Type will be tested against eachother to see if they differ significantly (their mean) from each other. As can be seen, no pairing implies any significant difference.

drop1(model.full, test = 'F')
tukey.test <- TukeyHSD(aov(model.full), "Type")
tukey.test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = model.full)
## 
## $Type
##                                           diff         lwr        upr     p adj
## Earthquake-Dam/Levee Break          0.27782593  -7.6599770  8.2156288 1.0000000
## Fire-Dam/Levee Break                1.02685790  -6.9109450  8.9646608 0.9999942
## Flood-Dam/Levee Break              -0.21366735  -7.6731620  7.2458273 1.0000000
## Hurricane-Dam/Levee Break          -0.36713257  -7.7761027  7.0418376 1.0000000
## Other-Dam/Levee Break               3.40821464  -6.9848090 13.8012383 0.9889658
## Severe Ice Storm-Dam/Levee Break   -0.91904959  -9.9196721  8.0815729 0.9999993
## Severe Storm-Dam/Levee Break       -0.44302424  -7.8113158  6.9252673 1.0000000
## Tornado-Dam/Levee Break            -2.58608373 -10.2937563  5.1215889 0.9870266
## Volcanic Eruption-Dam/Levee Break   5.48069130  -4.9123323 15.8737149 0.8038948
## Fire-Earthquake                     0.74903197  -3.4939022  4.9919661 0.9999166
## Flood-Earthquake                   -0.49149327  -3.7530627  2.7700761 0.9999783
## Hurricane-Earthquake               -0.64495850  -3.7892571  2.4993400 0.9997090
## Other-Earthquake                    3.13038871  -4.8074142 11.0681916 0.9615636
## Severe Ice Storm-Earthquake        -1.19687552  -7.1972905  4.8035395 0.9997689
## Severe Storm-Earthquake            -0.72085016  -3.7680612  2.3263608 0.9990751
## Tornado-Earthquake                 -2.86390965  -6.6589053  0.9310860 0.3248859
## Volcanic Eruption-Earthquake        5.20286538  -2.7349375 13.1406683 0.5346014
## Flood-Fire                         -1.24052524  -4.5020947  2.0210442 0.9695765
## Hurricane-Fire                     -1.39399047  -4.5382890  1.7503081 0.9215808
## Other-Fire                          2.38135674  -5.5564462 10.3191597 0.9942204
## Severe Ice Storm-Fire              -1.94590749  -7.9463225  4.0545075 0.9898102
## Severe Storm-Fire                  -1.46988213  -4.5170931  1.5773289 0.8747251
## Tornado-Fire                       -3.61294162  -7.4079373  0.1820540 0.0769895
## Volcanic Eruption-Fire              4.45383341  -3.4839695 12.3916363 0.7402653
## Hurricane-Flood                    -0.15346523  -1.7415327  1.4346023 0.9999995
## Other-Flood                         3.62188198  -3.8376126 11.0813766 0.8703115
## Severe Ice Storm-Flood             -0.70538225  -6.0570479  4.6462834 0.9999932
## Severe Storm-Flood                 -0.22935689  -1.6153006  1.1565868 0.9999517
## Tornado-Flood                      -2.37241638  -5.0252143  0.2803816 0.1242886
## Volcanic Eruption-Flood             5.69435865  -1.7651360 13.1538533 0.3086385
## Other-Hurricane                     3.77534721  -3.6336229 11.1843173 0.8339605
## Severe Ice Storm-Hurricane         -0.55191702  -5.8329305  4.7290965 0.9999991
## Severe Storm-Hurricane             -0.07589166  -1.1573805  1.0055972 1.0000000
## Tornado-Hurricane                  -2.21895115  -4.7261638  0.2882615 0.1336638
## Volcanic Eruption-Hurricane         5.84782388  -1.5611463 13.2567940 0.2631173
## Severe Ice Storm-Other             -4.32726423 -13.3278867  4.6733583 0.8769124
## Severe Storm-Other                 -3.85123887 -11.2195304  3.5170526 0.8120337
## Tornado-Other                      -5.99429836 -13.7019710  1.7133742 0.2829326
## Volcanic Eruption-Other             2.07247667  -8.3205470 12.4655003 0.9997694
## Severe Storm-Severe Ice Storm       0.47602536  -4.7477649  5.6998156 0.9999997
## Tornado-Severe Ice Storm           -1.66703413  -7.3595276  4.0254593 0.9951703
## Volcanic Eruption-Severe Ice Storm  6.39974090  -2.6008816 15.4003634 0.4124360
## Tornado-Severe Storm               -2.14305949  -4.5273826  0.2412637 0.1199250
## Volcanic Eruption-Severe Storm      5.92371554  -1.4445760 13.2920071 0.2393133
## Volcanic Eruption-Tornado           8.06677503   0.3591024 15.7744476 0.0319788
plot(tukey.test)

This model with only the variables that have an effect on the dependent variable, the output provides a different picture. Income per capita is not significant resp. very low significant (p < 0.1). If income per capita increases by one percent, damage increase by about 0.84%. When looking the the disaster type, the reference level is set at “Dam/Levee Break” and only “Volcanic Eruption” is significant. But it will not be further discussed, because this type of disaster is rare. Only 12% of the dependent variable variance can be explained by the three variables. The model is very poorly at explaining.

model.right <- lm(log.damage ~ log.income +  Type+ Duration, data = na.omit(df.model))
summary(model.right)
## 
## Call:
## lm(formula = log.damage ~ log.income + Type + Duration, data = na.omit(df.model))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.960 -1.570 -0.078  1.551  5.864 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            1.750439   5.472425   0.320   0.7493  
## log.income             0.837698   0.497469   1.684   0.0932 .
## TypeEarthquake         1.007025   2.474533   0.407   0.6843  
## TypeFire               1.504898   2.482205   0.606   0.5448  
## TypeFlood             -0.929765   2.316326  -0.401   0.6884  
## TypeHurricane         -0.813160   2.297781  -0.354   0.7237  
## TypeOther              3.478853   3.221237   1.080   0.2810  
## TypeSevere Ice Storm  -1.994636   2.789519  -0.715   0.4751  
## TypeSevere Storm      -1.207145   2.284732  -0.528   0.5976  
## TypeTornado           -3.328925   2.388800  -1.394   0.1645  
## TypeVolcanic Eruption  6.932481   3.280384   2.113   0.0354 *
## Duration               0.009687   0.006022   1.609   0.1088  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.278 on 299 degrees of freedom
## Multiple R-squared:  0.1602, Adjusted R-squared:  0.1293 
## F-statistic: 5.184 on 11 and 299 DF,  p-value: 1.822e-07
## log.income 
##    2.31104
## TypeVolcanic Eruption 
##              102403.4

Fitted Values

Fitted values are good at indicating how good the model is (goodness of fit). They are the values of y that the model predicts for each observation in out data based on the independent variable (here: Income per capita) and the estimated regression coefficients. The fitted values are centered at the middle.

library(car)
df.sub <- df.model[which(!is.na(fitted.model)), ]
# Create plot with both observed and fitted values
plot(log.damage ~ log.income, data = df.sub,
     main = "Model 'Log. Damage and Log. Income'",
     col = "pink4")
points(fitted.model ~ log.income,
       col = "lightblue4",
       pch = 19,
       data = df.sub)
abline(model, col = "red4")
legend("topright", 
       legend = c("Observed data", "Fitted values", "Model"),
       col = c("pink4", "lightblue4", "red4"), 
       pch = c(1, 19, NA), 
       lty = c(NA, NA, 1))

All in all, we can see that the model is poorly explaining the response variable damage. Maybe one should consider more biological variables, such as wind storms, to predict this response variable.

Chapter of choice

For every analysis some kind of outlier detection is a way to see what kind of structure the data has. To achieve this goal, we are using the k-means method. The way this works, multiple variables are given and from these groups are being formed. The goal is to have big differences between the formed groups. However, within group difference should be small.

#Dataframe
df.outlier.0 <- data.frame(Approved = df.prep.5$Approved, 
                              Repair = df.prep.5$Repair,
                           income = df.prep.5$IncomeCapita,
                           population = df.prep.5$Population,
                           damage = df.prep.5$Damage)

df.outlier.1 <- data.frame(repair = df.outlier.0$Repair,
                           income = df.outlier.0$income)
plot(df.outlier.1)

Clustering the two variables income and repair

When we scale the data we get one big cluster. By examining, after the initial clustering, we can see that the elbow is at 2 or 5. The further analysis with plots of the clusters and silhouette plot gives the indication that two clusters are the better solution. This means

#prep the data
datas <- scale(df.outlier.1, center = FALSE, scale = TRUE)
datas <- na.omit(datas)
boxplot(datas)

plot(datas)

dists <- dist(datas)
km <- kmeans(datas, centers = 3, nstart = 10)
groups_km <- km$cluster
groups_km
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [334] 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [408] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [445] 3
cluster_size <- cbind(sum(groups_km == 1), sum(groups_km == 2), 
                      sum(groups_km == 3))
cluster_size
##      [,1] [,2] [,3]
## [1,]    1    4  440
plot(datas, pch = groups_km, col=groups_km, lwd=2)
legend("topright", legend = 1:3, pch = 1:3, col=1:3, bty="n")

reps <- rep(0, 6)
for (i in 1:6) reps[i] <- sum(kmeans(datas, centers = i, nstart = 20)$withinss)
par(mfrow = c(1,1))
plot(1:6, reps, type = "b", xlab = "Number of groups", ylab = "Sum of squares")
text(3, 120, "Elbow point signifies how many \ngroups there could be", col = "red", cex = 1.08, pos = 4)

km2 <- kmeans(datas, centers = 2, nstart = 10)
groups_km2 <- km2$cluster
groups_km2
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [408] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [445] 2
cluster_size2 <- cbind(sum(groups_km2 == 1), sum(groups_km2 == 2))
cluster_size2
##      [,1] [,2]
## [1,]    2  443
plot(datas, pch = groups_km2, col=groups_km2, lwd=2)
legend("topright", legend = 1:2, pch = 1:2, col=1:2, bty="n")

Three clusters achieve a 0.93 width. When compared to the below silhouette plot, it performs better.

plot(silhouette(groups_km, dists), color='red3', main = "")

km5 <- kmeans(datas, centers = 5, nstart = 10)
groups_km5 <- km5$cluster
groups_km5
##   [1] 5 5 5 5 5 3 2 3 5 5 5 5 3 3 5 2 5 3 5 5 5 2 5 5 5 5 5 5 3 3 3 5 3 5 5 5 5
##  [38] 5 5 3 3 5 5 5 3 3 5 5 5 5 3 3 5 5 3 5 5 5 3 1 3 3 3 5 5 5 3 5 5 3 3 5 3 5
##  [75] 3 5 5 3 5 5 5 5 5 3 3 2 3 5 3 2 5 3 5 5 5 5 3 5 5 3 2 3 5 5 5 3 5 3 3 3 3
## [112] 5 5 5 3 5 3 3 3 5 5 3 3 3 3 5 5 3 3 5 5 3 5 5 5 3 5 5 5 2 3 3 3 5 3 3 5 5
## [149] 3 5 5 3 5 5 3 5 5 5 5 5 5 5 3 3 3 5 5 3 5 3 5 5 3 5 5 5 5 5 5 3 5 5 3 3 3
## [186] 3 5 5 5 2 3 5 5 5 5 5 5 5 5 3 5 5 5 5 5 5 5 5 5 5 3 5 5 5 5 3 3 5 5 2 3 5
## [223] 5 5 2 5 5 3 5 3 3 1 3 3 3 3 5 3 5 5 5 3 5 2 5 5 3 5 5 5 5 5 3 5 3 5 5 5 3
## [260] 5 5 3 3 5 3 3 3 5 3 5 3 3 5 5 5 4 3 5 5 3 5 5 5 5 5 5 5 5 3 5 5 5 5 5 5 5
## [297] 1 5 5 5 5 5 5 3 5 5 5 5 5 5 2 3 5 5 5 3 5 5 5 5 5 3 5 5 5 5 5 5 5 5 5 5 5
## [334] 5 5 5 5 2 1 3 2 5 5 5 5 3 5 5 3 3 3 5 5 5 3 3 3 5 5 3 3 3 2 3 3 2 5 5 5 5
## [371] 3 5 5 5 3 3 3 3 3 3 5 3 3 3 5 3 5 5 3 3 3 3 5 3 5 5 3 3 5 5 5 5 5 5 5 5 5
## [408] 3 5 5 5 5 5 3 5 3 3 5 5 3 3 5 5 5 5 5 5 5 5 5 5 5 5 3 5 5 5 5 3 5 5 5 5 5
## [445] 5
cluster_size5 <- cbind(sum(groups_km5 == 1), sum(groups_km5 == 2), 
                      sum(groups_km5 == 3), sum(groups_km5 == 4),sum(groups_km5 == 5))
cluster_size5
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4   16  143    1  281
plot(datas, pch = groups_km5, col=groups_km5, lwd=2)
legend("topright", legend = 1:5, pch = 1:5, col=1:5, bty="n")

The silhouette plot implies, that the clustering is done moderately good for 5 clusters.

plot(silhouette(groups_km5, dists))

dc <- dist(datas, method = "euclidean")

cc <- hclust(dc, method = "complete")
plot(cc,cex = 0.3, hang = -1)

Chapter of choice (Martina)

To make up for the time I missed during the block week on Monday afternoon, I was given the task to produce an additional chapter of choice with a R package that hasn’t been used before. Since we have date values in our data set, I thought it would be interesting to try and show a forecast. I started by generating a new dataframe containing the frequency with which “severe storms” happened per month between 2002 and 2023. Since (luckily!) there wasn’t such a disaster happening every month, I complemented the data frame by adding every month in the given time period and fill in “0” if there was no event in that month. To generate the time series plot, I loaded the packages “forecast” and also “zoo”, because for some reason, R converted the date column into a different index before using it as index for the x-axis. The zoo package allowed to specify the original date column as index for the x-axis.

library(forecast)
library(zoo)
Sstorms <- subset(df.prep.5, Type == "Severe Storm")
Sstorms_timeonly <- Sstorms[,c("Disaster#", "DisasterBegin")]

#calculating the frequency of severe storms per month
Sstorms_month_year <- format(Sstorms_timeonly$DisasterBegin, "%Y-%m")
freq_table <- table(Sstorms_month_year)

#generating the data frame
frequency <- data.frame(Sstorms_month_year = names(freq_table), frequency = as.numeric(freq_table))

#converting the date format to a format R can recognize for the time series plot
frequency$Sstorms_month_year <- as.Date(paste0(frequency$Sstorms_month_year, "-01"), format = "%Y-%m-%d")

##filling in all months without a severe storm happening and adding 0 values 
# Create a sequence of dates covering the entire range
start_date <- min(frequency$Sstorms_month_year)
end_date <- max(frequency$Sstorms_month_year)
all_dates <- seq(start_date, end_date, by = "month")

# Create a new data frame with missing months filled with 0 values
dates_df <- data.frame(date = all_dates)

colnames(frequency) <- c("date", "frequency")

# Merge the new data frame with the original data frame
dates_merged <- merge(frequency, dates_df, by = "date", all = TRUE)

# Replace missing values with 0
dates_merged[is.na(dates_merged$frequency), "frequency"] <- 0

#somehow, R did not use the date column for the x-axis but converted it first into other indexes. 
#Therefore, used the zoo package to create the time series object and specify
#date column as index

ts <- zoo(dates_merged$frequency, order.by = dates_merged$date)
frequency_tszoo <- zoo(dates_merged$frequency, order.by = dates_merged$date)

# Plot the time series with the original dates as the x-axis
plot(frequency_tszoo, type = "l", xlab = "Date", ylab = "Frequency", main = "Frequency of severe storms / month")

The time series plot indicates that there is a trend to decreasing frequency of severe storms in later years but no seasonal component can be clearly identified. Trying to decompose the data anyway was therefore not successful and resulted in the error message that the time series is not periodic or has less than two periods.

#stl <- stl(frequency_tszoo, na.action = "omit", s.window = 12)

In another try, I wanted to look at the autocorrelation and partial autocorrelation functions to get an indication if it could make sense to fit a linear model like an Auto-Regressive model on the data for the forecast. The acf resulted in the following plot whereas the pacf resulted in an error message that the maximal lag must be at least one.

acf(frequency_tszoo, na.action = na.pass)

#library(ggfortify)
#ggPacf(frequency_tszoo)

By looking at the time series plot, I would also suggest that it’s quite hard to do an accurate forecast since the frequency dropped considerably in the past few years. Therefore, I suggest we would need to get additional data to confirm that this is a long-lasting drop which we could base the forecast on.

PS: Just to try it out, I fitted an AR(1) model on the data as a whole and predicted the next 100 data points which resulted in non-nonsensical horizontal line:

PS: Just to try it out, I fitted an AR(1) model on the data as a whole and forecasted the next 100 data points which resulted in non-nonsensical horizontal line:

f_fit <- arima(frequency_tszoo, order = c(1,0,0))
forecast <- predict(f_fit, n.ahead = 100)
plot(frequency_tszoo, 
     xlim = c(18000, 19737),
     xlab = "Time",
     ylab = "Frequency")
lines(forecast$pred, lwd = 2, col = "red")

Summary & Conclusion

In this project, we focused on exploratory data analysis and different ways of displaying data. As it is often the case for amounts, count data, etc., our variables like amount of money tended to be right skewed which is why we mostly worked with log-transformed data. The long-transformation helps for example to generate figures that are easier to interpret in terms of detecting differences. However, the caveat is that the “real numbers” cannot be directly read from the graph. Furthermore, the log transformation “shifts” right-skewed data into a more gaussian distribution which is important when the goal is to use statistical analysis methods or machine learning models since most of the methods assume an underlying Gaussian distribution of the data. To generate plots, we quickly started to like the ggplot package. The explanation in the block seminar that ggplot uses a layered approach was very helpful and made using it and starting to play around much more easy and fun. Also the session on R Markdown was incredibly helpful and overall, we are taking a lot of great tips and tricks with us that will for sure make our daily coding life more enjoyable :-)